Abstract: One of the widely researched data mining problem in the text domains is Clustering. Classification, document organization, visualization, indexing and collaborative filtering are some of the various applications of this problem. Large amount of information present within documents is in the form of text. We cannot always retrieve data in a refined text format. It also contains a lot of Side Information, int the form of different links in the document, user-access behaviour, and document provenance information from web - logs or other non-textual attributes. Large amount of information may be contained in these attributes for clustering purposes. To estimate the relative information is very difficult and cumbersome in most of the cases, particularly when some of information is noisy data. In such situation, integrating this side information into mining process can be risky, because it might result in either: 1. Addition of noise into the data 2. Improvement of quality of data mining process An ethical way is needed for performing the data mining process, and for maximizing the advantages of the use of this available side information. We are proposing the use of K-means algorithm for improved and efficient clustering of the information in this paper.
Keywords: Information-Retrieval, Text-mining, Clustering, K-means, Side Information.